Using Unigram and Bigram Language Models for Monolingual and Cross-Language IR
نویسندگان
چکیده
Due to the lack of explicit word boundaries in Chinese, and Japanese, and to some extent in Korean, an additional problem in IR in these languages is to determine the appropriate indexing units. For CLIR with these languages, we also need to determine translation units. Both words and ngrams of characters have been used in IR in these languages; however, only words have been used as translation units in previous studies. In this paper, we compare the utilization of words and n-grams for both monolingual and cross-lingual IR in these languages. Our experiments show that Chinese character n-grams are reasonable alternative indexing and translation units to words, and they lead to retrieval effectiveness comparable to or higher than words. For Japanese and Korean IR, bigrams or a combination of bigrams and unigrams produce the highest effectiveness.
منابع مشابه
The University of Amsterdam at NTCIR-5
We describe the University of Amsterdam’s participation in the Cross-Lingual Information Retrieval task at NTCIR-5. We focused on Chinese monolingual retrieval, and aimed to study the effectiveness of language models and different tokenization methods for Chinese. Our main findings are the following. First, where the vector space model excels on a bigram index, the language model performs poorl...
متن کاملPhrasal Translation for English-Chinese Cross Language Information Retrieval
This paper introduces a simple and effective nonoverlapping unigram and bigram segmentation method for both monolingual Chinese and English-Chinese cross language retrieval. It also describes English-Chinese cross language retrieval experiments involving 54 topics and some 164,000 documents. The translation of English queries to Chinese is done using a Chinese-English dictionary of about 120,00...
متن کاملMonolingual Experiments with Far-East Languages in NTCIR-6
This paper describes our third participation in an evaluation campaign involving the Chinese, Japanese and Korean languages (NTCIR-6). Our participation is motivated by three objectives: 1) study the retrieval performances of various probabilistic and language models for these languages; 2) compare the relative retrieval effectiveness of a combined “unigram & bigram” indexing scheme combined wi...
متن کاملCross Language Information Retrieval for Biomedical Literature
This workshop report discusses the collaborative work of UT, EMC and TNO on the TREC Genomics Track 2007. The biomedical information retrieval task is approached using cross language methods, in which biomedical concept detection is combined with effective IR based on unigram language models. Furthermore, a co-occurrence method is used to select and filter candidate answers. On its own, the cro...
متن کاملSpecial Issue on Artificial Intelligence IJACSA Special Issue Guest Editor
The main aim of this study is to develop part-of-speech tagger for Afaan Oromo language. After reviewing literatures on Afaan Oromo grammars and identifying tagset and word categories, the study adopted Hidden Markov Model (HMM) approach and has implemented unigram and bigram models of Viterbi algorithm. Unigram model is used to understand word ambiguity in the language, while bigram model is u...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2007